Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic HPC Install Script #329

Merged
merged 71 commits into from
Oct 23, 2024
Merged

Conversation

TimothyWillard
Copy link
Contributor

@TimothyWillard TimothyWillard commented Oct 2, 2024

Describe your changes.

This adds a generic hpc_install.sh script which can reproducibly setup and install flepiMoP on both rockfish and longleaf. The script vaguely:

  1. Figures out HPC specific variables and modules, currently only rockfish and longleaf are supported.
  2. Loads sensitive credentials.
  3. Sets up a FLEPI_PATH environment variable.
  4. Sets up/updates a conda environment.
  5. Ensures that the R/python versions of arrow are compatible. These checks are loose and not definitive.
  6. Install custom R packages.
  7. Set up environment variables common for use with flepiMoP.

Going to add a separate PR for documentation since that needs to be merged into gitbook-documentation. But user usage would look something like on rockfish:

wget https://raw.githubusercontent.com/HopkinsIDD/flepiMoP/refs/heads/GH-191/longleaf-batch-submission/build/hpc_install.sh
vim /scratch4/struelo1/flepimop-code/ext-twillard/slack_credentials.sh
chmod 600 /scratch4/struelo1/flepimop-code/ext-twillard/slack_credentials.sh
source hpc_install.sh rockfish

and on longleaf:

wget https://raw.githubusercontent.com/HopkinsIDD/flepiMoP/refs/heads/GH-191/longleaf-batch-submission/build/hpc_install.sh
vim /users/t/w/twillard/slack_credentials.sh
chmod 600 /users/t/w/twillard/slack_credentials.sh
source hpc_install.sh longleaf

And replacing the url with the appropriate one. Then keeping your environment up to date is as easy as:

source /scratch4/struelo1/flepimop-code/ext-twillard/flepiMoP/build/hpc_install.sh rockfish

on rockfish or:

source /users/t/w/twillard/flepiMoP/build/hpc_install.sh longleaf

on longleaf.

A big open question is how best to install packages. Right now it is installing gempyor, flepiconfig, flepicommon, and inference from GitHub rather than locally. Whereas I think installing locally would be preferred dev reasons, at least in the meantime.

What does your pull request address? Tag relevant issues.

One of many steps required for GH-191. Should resolve GH-308

Tag relevant team members.

@pearsonca, @shauntruelove, @MacdonaldJoshuaCaleb

Edit: Fix wget url to use the "raw" file instead of the pretty version.

Heavily inspired by the original `batch/slurm_init.sh` script. The init
script is a run once script that takes care of installation of
dependencies and setup whereas prerun sets env vars needed per a run.
Initial version of the HPC install script, some what inspired by the
slurm init script.
* Changed how the R arrow version is formatted for readability.
* Changed the final output command to print diagnostic info correctly.
Added slurm's --partition flag to the `batch/inference_job_launcher.py`
script for usage on UNC's Longleaf cluster.
The longleaf specific init/pre-run scripts are now surpassed by the
generic `build/hpc_install.sh` script.
Remove the --partition flag for the slurm partition to use from the
inference job launcher script. This will be handled in a new
flepiscripts script.
Copy link
Contributor

@pearsonca pearsonca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally good, but a few questions to address.

elif [[ $1 == "rockfish" ]]; then
# Setup general purspose user variables needed for RockFish
USERDIR="/scratch4/struelo1/flepimop-code/$USER/"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to cd to USERDIR as well here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if we do, several of the $USERDIRs below can/must be eliminated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could add creating some hpc-wide environmental variables to the longleaf-setup repo. does that make sense to pair with this?

lastly ... bit weird that we're doing install here in scratch. why not in $HOME? i get doing projects on scratch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per in-person conversation:

  • need to check the preferred location for libraries on longleaf & rockfish
  • maybe refer to that as $LIBDIR (or ACCLIBDIR or some such)
  • might want to move that as a generic variable to be set on the HPC, and if so - move that to the longleaf-setup directions (which could itself stand to be scriptified) and make that setup a prerequisite to this? (one downside to that would be other people on other HPCs wanting to use / modify this script - future problem?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least on the longleaf side it looks like /users is similar to $HOME, the documentation states "Think of it as a capacity expansion to your home directory." However, I think maybe the project directory should be moved to /work since that's high throughput and designed for active jobs. So my take is:

  • flepiMoP and flepimop-env stay in /users, especially for the conda env since that directory can get large and $HOME has some low and strict storage caps.
  • Move the project directory to /work since that'll actually need throughput for the job.

I still need to dig up the rockfish documentation. Longleaf docs: https://help.rc.unc.edu/getting-started-on-longleaf/#main-directory-spaces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these scripts - the install all worked great for me on longleaf. 🎉

I'm also open to having these things anywhere, but as @jcblemai said I think having everything (including the flepimop libraries) in /work or /scratch makes the most sense, including the flepiMoP folder itself. I understand how installing these in /users or /home would be ideal if flepiMoP was stable but from a practical perspective, I am operationally often changing things within flepimop and reinstalling things run-to-run, playing with my own different environments, jumping between branches, or jumping between different FLEPI_PATH 's (not ideal, but practically this is just what we've had to do with concurrent runs and changes). So for convenience it would be good to just have everything in the same place, imo.

Separately, with my experience with running stuff in the past I was confused with having to link the specific location of the flepimop-env . I'm fine either way, I just don't think I follow why the change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saraloo is there a general class of the things you're changing?

Copy link
Contributor Author

@TimothyWillard TimothyWillard Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put everything in /scratch/ on rockfish (as per the current doc)

This has been the case since 9ca12ed. I see, this is not the case for the $USERDIR variable, done now as well.

and in /work/ on longleaf, for both convenience and speed.

This is done now. I think I am misinterpreting the docs (see https://help.rc.unc.edu/getting-started-on-longleaf/#main-directory-spaces) on the differences between /users and /work. @jcblemai what are the practical differences between the two? My interpretation was that /work was mean for high IO short term storage for active work whereas /users is designed for longer term lower IO (read okay?) storage for libraries/codebases.

I am operationally often changing things within flepimop and reinstalling things run-to-run, playing with my own different environments, jumping between branches, or jumping between different FLEPI_PATH

@saraloo is this normal operational behavior? This sounds like the installation script needs to be much more accommodating to flexibility if this is the case. For the different environments, do you mean switching between multiple conda envs? What makes each of these envs distinct? As far as jumping branches this script won't do anything to your flepiMoP clone, although you can switch the branch yourself and then run this script again to update the conda env with the code from that branch ("install" is a misnomer, it really should be "install or update", I'll change the script name and make sure this is clear when writing the documentation), does that accommodate this use case? As for different $FLEPI_PATHs this script checks if this env var is set before doing anything, and if it is just uses the set value so there should be no issue setting custom $FLEPI_PATHs. Have you tested this yet and does it accommodate your use case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding to both simultaneously. No, i don't think this is normal behavior so feel free to make a judgement call on your end. Just in the past during larger periods of development which inevitably coincide with operational demands I was running two or three different diseases on significantly different gempyor and/or R inference setups from different conda environments (again, don't think this will necessarily be standard, especially now that more people can run stuff). Just flagging that there will be circumstances where flexibility is preferable and want to reduce the possibility of someone setting the wrong flepimop version they;re working on or something, or reducing having to jump around etc to switch branches.
And sorry, haven't tested the FLEPI_PATH bit yet but that makes sense and I don't anticipate any issues there with setting that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we move to the new workflow carl described today, it means that we will have custom branch for runs so you can envision someone running Flu and RSV from the same account but using two different flepiMoP branch. I however think this flexibility can be added alter with the pre-runs scripts that are mentionned below.

Sometime also when running to many parallel run we can have some filesystem lock on the packages, which is always annoying, but I would not worry about it too much.

RE to sara's questions: do we need to specify the emplacement of the conda environement ?

. @jcblemai what are the practical differences between the two? My interpretation was that /work was mean for high IO short term storage for active work whereas /users is designed for longer term lower IO (read okay?) storage for libraries/codebases.

This is correct, but flepiMoP does not support writing to other folder other than the project one, so we work from work.

build/hpc_install.sh Outdated Show resolved Hide resolved
build/hpc_install.sh Outdated Show resolved Hide resolved
build/hpc_install.sh Outdated Show resolved Hide resolved
build/hpc_install.sh Outdated Show resolved Hide resolved
build/hpc_install.sh Outdated Show resolved Hide resolved
@jcblemai
Copy link
Collaborator

jcblemai commented Oct 2, 2024

That's really great thank you, ideally we would have a per-cluster specific configuration file that would populate some variables like:

  1. where is project_dir
  2. where would the final files go
  3. what are the pre-processing steps needed before the base flepiMoP one.
    Then some cluster-agnostic script would go.

Bit 3. will be used also by the runner script.

We decided to do different commands on the doc instead of a script so that the error are not silent and gradually reported. I would make sure the script exit on failure maybe using set -e and set -x.

rockfish.yml

paths:
  - final_output_path: /scratch4/struelo1/flepimop-runs/
  - project_path: /scratch4/struelo1/flepimop-code/
  - secrets_path: $USER/flepi_secrets.sh

init_commands:
  - module purge
  - module load gcc/9.3.0
  - module load git
  - module load git-lfs
  - module load slurm
  - module load anaconda3/2022.05
  - conda activate flepimop-env

(perhaps it's better if the above is a bash file)

build/hpc_install.sh Outdated Show resolved Hide resolved
build/hpc_install.sh Outdated Show resolved Hide resolved
build/hpc_install.sh Outdated Show resolved Hide resolved
* Changed `flepiMoP` git clone to use ssh instead of http to allow for
  edits from HPC.
* Add `set -e` to error clearly on a command failure.
* Install `gempyor` from cloned `flepiMoP` repo directly, yet to do the
  same for R packages.
@jcblemai
Copy link
Collaborator

jcblemai commented Oct 3, 2024

Also because the above comments focuses on what's missing but it's really awesome that this runs on longleaf (and would have been very useful right now, except that I'm running emcee).

@TimothyWillard TimothyWillard force-pushed the GH-191/longleaf-batch-submission branch from 95035a4 to 3be416d Compare October 4, 2024 16:29
build/setup.R Outdated Show resolved Hide resolved
build/setup.R Outdated Show resolved Hide resolved
@TimothyWillard TimothyWillard force-pushed the GH-191/longleaf-batch-submission branch 2 times, most recently from e9344cb to b1e6670 Compare October 7, 2024 14:10
Clean up error handling after script exits a bit more nicely.
@TimothyWillard TimothyWillard dismissed stale reviews from jcblemai and pearsonca via aa1de4f October 21, 2024 21:24
@TimothyWillard
Copy link
Contributor Author

I like installing with pip -e because it means I can update gempyor very fast/switch branch and have the new thing ready.

This has been added now.

Your first comment about how to use this PR should be added to the doc right ? and replace manual environment creation.

The first comment is a bit out of date now, usage has changed some. However, I plan on replacing the current HPC install/update guides in the flepiMoP wiki in a separate PR into the documentation-gitbook branch and reference it to this PR.

Conda environment management has changed some from the discussion in #329 (comment) per an slack discussion with @jcblemai and @saraloo. Will now use a default environment in ~/.conda rather than an absolute path in the work directories of the clusters.

@pearsonca
Copy link
Contributor

Not sure where @jcblemai's comment went re branches wanted for different operational runs, but working trees seem like an option: https://stackoverflow.com/questions/2048470/git-working-on-two-branches-simultaneously - might be something to do via init script?

Switch from using a conda environment specified by a path to a conda
environment specified by a name assumed to be in `~/.conda`.
@TimothyWillard TimothyWillard force-pushed the GH-191/longleaf-batch-submission branch from aa1de4f to 70ced08 Compare October 22, 2024 14:07
This option is not compatible with the `--editable` flag.
@TimothyWillard
Copy link
Contributor Author

Not sure where @jcblemai's comment went re branches wanted for different operational runs, but working trees seem like an option: https://stackoverflow.com/questions/2048470/git-working-on-two-branches-simultaneously - might be something to do via init script?

This is currently handled in batch/inference_job_launcher.py which is I think where you want to handle branching for job submission and hasn't been touched by this PR yet. Do you want me to make changes to that in this PR as well now? Maybe this is something better fit for https://github.com/ACCIDDA/flepiscripts/pull/1?

Looks good to me, provided a small batch run works with it.

After the recent round of edits I was able to submit one of the recent Flu configs to rockfish using an environment setup and initialized using the build/hpc_install_or_update.sh and batch/hpc_init.sh scripts. It was a small submission (2 jobs, 200 iterations) so the actual outputs aren't helpful, but does demonstrate that the tools provided in this PR can setup a HPC environment suitable for batch submission.

@pearsonca
Copy link
Contributor

This is currently handled in batch/inference_job_launcher.py which is I think where you want to handle branching for job submission and hasn't been touched by this PR yet. Do you want me to make changes to that in this PR as well now? Maybe this is something better fit for ACCIDDA/flepiscripts#1?

Let's aim for "next up" on that - I'd like an integration of these scripts w/ the CLI to support flepimop update and flepimop init ... actions.

@TimothyWillard
Copy link
Contributor Author

TimothyWillard commented Oct 22, 2024

Let's aim for "next up" on that - I'd like an integration of these scripts w/ the CLI to support flepimop update and flepimop init ... actions.

Sure, can you create an issue for that to move discussion of details there?

jcblemai
jcblemai previously approved these changes Oct 22, 2024
pearsonca
pearsonca previously approved these changes Oct 22, 2024
Required resolving conflicts in `inference`'s `DESCRIPTION` and
`install_cli.R`.
@TimothyWillard TimothyWillard dismissed stale reviews from pearsonca and jcblemai October 22, 2024 19:09

The merge-base changed after approval.

@TimothyWillard TimothyWillard merged commit 24d243c into main Oct 23, 2024
3 checks passed
@TimothyWillard TimothyWillard deleted the GH-191/longleaf-batch-submission branch October 23, 2024 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
batch Relating to batch processing. installation Relating to installation / upgrade / migration.
Projects
None yet
6 participants